[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

Update: 2025-10-28

Description

This is a link post.

By Ben Wilson and John Bash from Metaculus

Main Takeaways

Top Findings

Pro forecasters significantly outperform bots: Our team of 10 Metaculus Pro Forecasters demonstrated superior performance compared to the top-10 bot team, with strong statistical significance (p = 0.00001) based on a one-sided t-test on Peer scores.
The bot team did not improve significantly in Q2 relative to the human Pro team: The bot team's head-to-head score against Pros was -11.3 in Q3 2024 (95% CI: [-21.8, -0.7]), then -8.9 in Q4 2024 (95% CI: [-18.8, 1]), then -17.7 in Q1 2025 (95% CI: [-28.3, -7.0]), and now -20.03 [-28.63, -11.41] with no clear trend emerging. (Reminder: a lower head-to-head score indicates worse relative accuracy. A score of 0 corresponds to equal accuracy.)

Other Takeaways

This quarter's winning bot is open-source: Q2 Winner Panshul has very generously made his bot open-source. The bot writes separate “outside view” and “inside view” [...]

---

Outline:

(00:20 ) Main Takeaways

(03:24 ) Introduction

(04:30 ) Methodology

(13:59 ) How do LLMs Compare?

(17:18 ) Which Bot Strategy is Best?

(23:04 ) Are Bots Better than Human Pros?

(25:38 ) Binary vs Numeric vs Multiple Choice Questions

(27:07 ) Team Performance Over Quarters

(31:14 ) Bot Maker Survey

(31:40 ) Best practices of the best-performing bots

(38:27 ) Other Survey Results

(41:32 ) How did scaffolding do?

(45:33 ) Advice from Bot Makers

(53:48 ) Links to Code and Data

(54:56 ) Future AI Benchmarking Tournaments

---

First published:

October 28th, 2025

Source:

https://forum.effectivealtruism.org/posts/F2stjK9wHSy3HPEC9/q2-ai-benchmark-results-pros-maintain-clear-lead

Linkpost URL:
https://www.metaculus.com/notebooks/40456/q2-ai-benchmark-results/

---

Narrated by TYPE III AUDIO.

---

Images from the article:

<a href="https://39669.cdn.cke-cs.com/cgyAlfpLFBBiEjoXacnz/images/62929a60884878146c86b7ca234ab4f6bd1706d412fc25d4.png" target="_bla

Comments

In Channel

“Fish Have Surprisingly Impressive Mental Capabilities” by Damin Curtis🔹

2025-10-2905:44

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

2025-10-2855:37

“Modernizing the EA Funds website” by Agnes Stenlund 🔸

2025-10-2805:14

“Is ‘cage free’ really the most humane option for egg-laying hens?” by mvolz

2025-10-2813:50

“Career Advice for Policy Researchers” by Shaan Shaikh

2025-10-2820:27

“Should I tell EA orgs I’ll work for much less than advertised?” by Kaleem

2025-10-2804:23

“Framing EA: ‘Doing Good Better’ Did Worse” by Rethink Priorities, David_Moss

2025-10-2709:35

“What were mistakes of AI Safety field-building? How can we avoid them while we build the AI Welfare?” by guneyulasturker 🔸

2025-10-2701:54

“Giving Season 2025 Announcement” by Toby Tremlett🔹

2025-10-2507:24

“On What Is Prevented” by Bentham’s Bulldog

2025-10-2508:07

[Linkpost] “Dollars in political giving are less fungible than you might think” by lincolnq

2025-10-2509:01

“Some personal thoughts about working at Tarbell” by sawyer🔸

2025-10-2410:01

“ACS is hiring: why work here and why not” by Jan_Kulveit

2025-10-2403:38

“Software Engineering at SecureBio” by Jeff Kaufman 🔸

2025-10-2301:45

[Linkpost] “The Charity Trap: Brain Misallocation” by DavidNash

2025-10-2310:22

[Linkpost] “Statement on Superintelligence - FLI Open Letter” by plex

2025-10-2300:50

“Canva to donate $100M over 4 years to GiveDirectly” by MartinBerlin

2025-10-2302:56

[Linkpost] “Consider donating to AI safety champion Scott Wiener” by Eric Neyman

2025-10-2200:36

“AI and Animal Welfare: A Policy Case Study from Aotearoa New Zealand Policy” by Karen Singleton

2025-10-2214:00

“The First Global Accounting Standard for Nonprofits Just Launched — And It Might Actually Matter” by Yufeng (Andy) Tao

2025-10-2208:11

00:00

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

#box-pro-ellipsis-176171721572366{-webkit-line-clamp:2;}[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus